About the Data

This is data that the city of Boulder keeps open to the public. They maintain ~30 data sets for various different areas of interest such as crime, homelessness, and city affairs. This analysis is leveraging the data set, 2019 Council Emails Dataset:

Data set containing 2019 emails to . This version works best for viewing in Excel, as it includes only the plain text version of the emails. This file is updated daily with new emails.

raw_data <- read_csv("http://www-static.bouldercolorado.gov/docs/opendata/CouncilEmails_PlainText2019.csv")
## Parsed with column specification:
## cols(
##   SentFrom = col_character(),
##   SentTo = col_character(),
##   SentCC = col_character(),
##   ReceivedDate = col_character(),
##   EmailSubject = col_character(),
##   PlainTextBody = col_character(),
##   MessageIdentifier = col_character()
## )
# raw_data %>% 
#   datatable()

Understanding the Data

This data set has the following columns : SentFrom, SentTo, SentCC, ReceivedDate, EmailSubject, PlainTextBody, MessageIdentifier. They are relatively self explanatory. One item to note is that all emails included in this data set are sent to . They may also be sent to others such as boulderplanningboard or specific city council members.

Data Cleaning

These data are a little messy. Not bad in the grand scheme of data messiness but we need to remove some junk if we want any reasonable results. I have created a table of what I removed and why:

removed <- 
  list()
removed$no_reply <- 
  raw_data %>% 
  filter(SentFrom == "No Reply") # these are spam

removed_rows <- 
  bind_rows(removed, .id = "removal_reason")

city_council_members <- 
  c("Jones, Suzanne",
    "Weaver,  Sam",
    "Brockett, Aaron",
    "Yates,  Phillip",
    "Carlisle, Cynthia",
    "Grano, Jill",
    "Morzel, Lisa",
    "Young,  Mary",
    "Nagle, Mirabai")

city_officials <- 
  c("Aulabaugh, Shannon", # police spokesperson
    "Brautigam, Jane") #city manager 

data_clean <- 
  raw_data %>% 
  # remove bad data
  anti_join(removed_rows %>% 
              select(-removal_reason)) %>% 
  # create some flag variables
  mutate(IsMasked = ifelse(str_detect(SentFrom, "\\[*\\]"), T, F),
         IsReply  = ifelse(str_detect(EmailSubject, "^R[E,e]:"), T, F),
         FromCCM  = ifelse(str_detect(SentFrom, 
                                      pattern = str_c(city_council_members, collapse = "|")),
                           T,
                           F),
         ToCCM  = ifelse(str_detect(SentTo, 
                             pattern =  str_c(city_council_members, collapse = "|")),
                           T,
                           F),
         FromCO = ifelse(str_detect(SentFrom, 
                             pattern =  str_c(city_officials, collapse = "|")),
                           T,
                           F)) %>% 
  # format strings
  mutate_at(vars(EmailSubject), str_to_lower) %>% 
  mutate(ReceivedDate = as_date(ReceivedDate)) 
## Joining, by = c("SentFrom", "SentTo", "SentCC", "ReceivedDate", "EmailSubject", "PlainTextBody", "MessageIdentifier")
removed_rows %>% 
  group_by(removal_reason) %>% 
  summarise(n_rows_removed = n()) %>% 
  ungroup() %>% 
  format_table()
Removal reason N rows removed
no_reply 302

Metadata

The meat of this data set is the plain email text body. We’ll get to that later. First, let’s look at the meta data about whose sending the emails and when they are sending them.

Senders

My hypothesis is that there aren’t too many unique senders. Unfortunately public engagement is low in general. As expected, we see a small handful having set the majority of emails.

frequency_email <- data_clean %>% 
  count(SentFrom, FromCCM, FromCO) %>% 
  mutate(FromCCM = case_when(FromCCM ~ "City Council Member", 
                             FromCO  ~ "City Official",
                             T       ~ "Constituent")) %>% 
  arrange(desc(n)) %>% 
  mutate(SentFrom = as_factor(SentFrom)) 

frequency_email %>% 
  plot_ly(x = ~SentFrom, y = ~n, color = ~FromCCM, type = "bar")

If we look at some descriptive statistics about whose sending emails and their frequency, we get the following:

frequency_email %>% 
  summarise(total_observations = sum(n),
            mean   = mean(n),
            median = median(n),
            mode   = getmode(n),
            max    = max(n),
            min    = min(n)) %>% 
 format_table()
Total observations Mean Median Mode Max Min
5180 2.060461 1 1 109 1

As expected, this is a very skewed distribution with most only sending one email and one individual having sent 109.

Dates of Emails

Looking at the dates that emails get sent, it appears that there are outside triggers that incent people to send emails to the city council on certain days. Without looking at the calandar for city council sessions, it is a good guess that the days with the most email traffic are days prior and post city council sessions.

data_clean %>% 
  count(ReceivedDate) %>% 
  plot_ly(x = ~ReceivedDate, y = ~n, type = 'scatter', mode = 'lines')

Coucil and City Officials Engagement

Based on reply signs in the email subject, we can also see how engaged the council or city officials are with constituents:

data_clean %>% 
  count(IsReply) %>% 
  filter(!is.na(IsReply)) %>% 
  mutate(IsReply = ifelse(IsReply, "Response Email", "Original Email")) %>% 
  plot_ly(x = ~IsReply, y = ~n, type = "bar")

This is a bit low, I would expect more engagement since most emails are relatively unique and not campaigns by interest groups. Let’s see what types of trends we can find based on when city officials reply.

Email Duplicates

People frequently copy paste an email in order to show more public awareness to an issue. We are checking by email subject for now. As expected, there are some movements in Boulder that coordinate the efforts through this email channel.

data_clean %>% 
  filter(!IsReply) %>% 
  count(EmailSubject) %>% 
  arrange(desc(n)) %>% 
  mutate(EmailSubject = as_factor(EmailSubject)) %>% 
  plot_ly(x = ~EmailSubject, y = ~n, type = "bar")

We are starting to see some trends here. This is not conclusive, but gives us a good starting point for further analysis. There are a couple of notable trends. Mostly surounding Boulder’s Open Space Mountain Parks, Bike Trails, Vaping, Homeless Shelters, and 5g:

data_clean %>% 
  filter(!IsReply) %>% 
  count(EmailSubject) %>% 
  arrange(desc(n)) %>% 
  head(10) %>% 
  format_table()
Emailsubject N
please protect boulder open space 72
help us curb youth tobacco and nicotine use in boulder 63
vaping 20
balance is needed on osbt! 19
vaping ordinance 19
long family farm 17
alpine balsam 16
alpine-balsam 16
marpa house 16
prairie dogs 16

This is a good start. Let’s first bin the emails into categories by extracting common phrases

Find most commons words and phrases

tidy_text <- data_clean %>% 
  unnest_tokens(word, EmailSubject) %>% 
  anti_join(stop_words, by = "word")
  

tidy_text %>% 
  count(word) %>% 
  filter(!(word %in% c("boulder", "council", "city", 'fw', "fwd", 2019))) %>% # remove some meaningless words
  with(wordcloud(word, n, max.words = 50))
## Warning in wordcloud(word, n, max.words = 50): protect could not be fit on
## page. It will not be plotted.

tidy_phrases <- 
  data_clean %>% 
  unnest_tokens(word, PlainTextBody, token = "ngrams", n = 2) %>% 
  separate(word, into = c("w_1", "w_2"), remove = F) %>% 
  anti_join(stop_words, by = c("w_1" = "word")) %>% 
  anti_join(stop_words, by = c("w_2" = "word")) %>% 
  count(word) %>% 
  arrange(desc(n))
## Warning: Expected 2 pieces. Additional pieces discarded in 57337 rows [2,
## 3, 10, 11, 95, 96, 110, 111, 112, 113, 126, 127, 130, 131, 213, 214, 304,
## 305, 313, 314, ...].
tidy_phrases
## # A tibble: 145,898 x 2
##    word                                                                   n
##    <chr>                                                              <int>
##  1 city council                                                        4063
##  2 boulder city                                                        1492
##  3 boulder county                                                      1155
##  4 http tinyurl.com                                                     883
##  5 council 6bb18bf32d5a8f2798e95f9c79a8dbd48d5a1350ba840a320e2eb050e…   866
##  6 police department                                                    811
##  7 boulder police                                                       761
##  8 dear council                                                         737
##  9 prairie dogs                                                         727
## 10 south boulder                                                        723
## # … with 145,888 more rows

Data Analysis

Open Space

Bike Trails

Vaping

Boulder Homeless Shelter

5G Network